class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 02: A Quick Introduction to
,
, and LLMs ] .author[ ###
Fadel M. Megahed, PhD
Endres Associate Professor
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Fall 2023 ] --- # Quick Refresher from Last Class <img src="data:image/png;base64,#../../figures/ideogram_taylor_swift.jpeg" width="45%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Note:** Image generated using <https://ideogram.ai/>. **Pros:** Currently best text-2-image generator while embedding text, and free (useful for professional logos). **Cons:** Not the best with hands. ] --- # Quick Refresher from Last Class ✅ Describe course objectives & structure ✅ Define data visualization & describe its main goals ✅ Describe the BI methodology and major concepts --- # Learning Objectives for Today's Class - Describe how and why we use scripted languages in this course. - Utilize the project workflow in RStudio (we will try to use that as an IDE for
and
). - Understand the syntax, data structures and functions in both
and
. - Understand the potential impact of LLMs on businesses and explore how they can be leveraged in the context of this class. --- class: inverse, center, middle # Scripted Languages (i.e.,
and
) --- # Pedagogy Behind Using Scripted Languages .center[###
] ```r crashes = # reading the data directly from the source readr::read_csv("https://data.cincinnati-oh.gov/api/views/rvmt-pkmq/rows.csv?accessType=DOWNLOAD") |> # changing all variable names to snake_case janitor::clean_names() |> # selecting variables of interest dplyr::select(address_x, latitude_x, longitude_x, cpd_neighborhood, datecrashreported, instanceid, typeofperson, weather) |> # engineering some features from the data dplyr::mutate( datetime = lubridate::parse_date_time(datecrashreported, orders = "'%m/%d/%Y %I:%M:%S %p", tz = 'America/New_York'), hour = lubridate::hour(datetime), date = lubridate::as_date(datetime) ) ``` --- # Pedagogy Behind Using Scripted Languages .center[###
] ```python import pandas as pd import datetime as dt crashes = ( # Reading the CSV file from a URL pd.read_csv('https://data.cincinnati-oh.gov/api/views/rvmt-pkmq/rows.csv?accessType=DOWNLOAD') # Renaming all columns to be lowercase and replacing spaces with underscores .rename(columns={col: col.lower().replace(" ", "_") for col in pd.read_csv('https://data.cincinnati-oh.gov/api/views/rvmt-pkmq/rows.csv?accessType=DOWNLOAD').columns}) # Selecting only the columns of interest .loc[:, ["address_x", "latitude_x", "longitude_x", "cpd_neighborhood", "datecrashreported", "instanceid", "typeofperson", "weather"]] # Adding new columns: 'datetime', 'hour', and 'date' derived from 'datecrashreported' .assign( datetime=lambda df: pd.to_datetime(df['datecrashreported'], format="%m/%d/%Y %I:%M:%S %p"), # Convert 'datecrashreported' to datetime format hour=lambda df: df['datetime'].dt.hour, # Extracting the hour from 'datetime' date=lambda df: df['datetime'].dt.date # Extracting the date from 'datetime' ) ) ``` ??? Here's what each major function does: 1. pd.read_csv(...): Reads the CSV file from the given URL and converts it into a Pandas DataFrame. 2. .rename(...): Renames all columns in the DataFrame according to the dictionary provided. In this case, it makes all column names lowercase and replaces spaces with underscores. 3. .loc[..., [...]]: Selects only the columns specified in the list, effectively filtering out the rest. 4. .assign(...): Adds new columns to the DataFrame based on some operation or transformation. In this case, it adds the datetime, hour, and date columns. --- # The Beauty of Programming Languages - Programming languages are **languages**. - **It's just text** -- which gives you access to **two extremely powerful techniques**!!! + .font100[`Ctrl` + `C`
] + .font100[`Ctrl` + `V`
] - In addition, programming languages are generally: + Readable (IMO way easier than trying to figure what someone did in an
) + Open (so it is easier to
or **ChatGPT** it) + Reusable and reproducible (so you can reuse your code for similar problems and other people can get the same results as you easily) + Diffable (version control is extremely powerful) .footnote[ <html> <hr> </html> **Source:** Content in "The Beauty of Programming Languages" is from [Hadley Wickham's You Can't Do Data Science in a GUI](https://speakerdeck.com/hadley/you-cant-do-data-science-in-a-gui?slide=14) ] --- # How to Learn Any Programming Language <html> <center> <iframe src="https://giphy.com/embed/xonOzxf2M8hNu" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p> </center> </html> * 🗣 **Get hands dirty** ‼️ * 📖 Documentation! Documentation! Documentation! * 🔎 (Not surprisingly) Learn to Google/ChatGPT: what that error message means (I do that a lot 😄) .footnote[ <html> <hr> </html> **Source:** Slide is based on [Kia Ora's How I Learn a Technology](https://stats220.earo.me/01-intro.html#7). ] --- class: inverse center middle # The RStudio Interface, Setup and a Project-Oriented Workflow for your Analysis --- ## RStudio Interface .center[<img src="../../figures/rstudio-interface.png" width="80%">] .footnote[ <html> <hr> </html> image credit: Stuart Lee] ??? live --- ## Setting up RStudio (do this once for
) .pull-left[ Go to **Tools** > **Global Options**: .center[<img src="../../figures/rstudio-setup.PNG" width="100%">] ] .pull-right[ <br> <br> <br> <br> Uncheck `Workspace` and `History`, which helps to keep
working environment fresh and clean every time you switch between projects. ] --- ## What is a project? * Each university course is a project, and get your work organised. * A self-contained project is a folder that contains all relevant files, for example my `ISA 401/` 📂 includes: + `isa401.Rproj` + `lectures/` + `01_Introduction/` * `01-Introduction.Rmd`, etc. + `02_llms_r_python/` * `02_llms_r_python.Rmd`, etc. * All working files are **relative** to the **project root** (i.e. `isa401/`). * The project should just work on a different computer (in most cases). --- ## Setting up RStudio (do this once for
) .pull-left-3[ .font70[ .center[**Installing Python**] 1. (Preferred) Install
via `reticulate::install_miniconda()`. 2. Install
from [python.org](https://www.python.org/downloads/). 3. Setup virtual environment and install needed
.center[**Helping R Find the Correct Version of Python**] 1. Edit your RProfile for the project to connect to a specific version of Python, .font70[`Sys.setenv(RETICULATE_PYTHON = "C:\\tools\\Anaconda3\\envs\\spc_gpt\\python.exe")`] 2. Configure a default version of Python to be used with RStudio via Tools -> Global Options... -> Python ] ] .pull-right-3[ <iframe src="python_with_r_studio.html" width="100%" height="470px" data-external="1"></iframe> ] .footnote[ <html> <hr> </html> **Note:** Feel free to use other IDEs of your choice for Python and/or R. During class, I will use RStudio for both to reduce the possible friction from using and/or setting up multiple IDEs. ] --- class: inverse, center, middle # Operators 101 --- # Assignment .pull-left[ ###
```r x1 <- 5 x2 = 5 5 -> x3 print(paste0("The values of x1, x2, and x3 are ", x1, ", ", x2, ", and ", x3, " respectively")) ``` ``` ## [1] "The values of x1, x2, and x3 are 5, 5, and 5 respectively" ``` ] .pull-right[ ###
```python x1 = 5 # no '<-' operator x2 = 5 # only "=" x3 = 5 # no '->' operator print(f"The values of x1, x2, and x3 are {x1}, {x2}, and {x3} respectively") ``` ``` ## The values of x1, x2, and x3 are 5, 5, and 5 respectively ``` ] The assignment consists of three parts: - The left-hand side: **variable names** (`x1` or `x2`), - The assignment operator: `=`, and the right-hand side: **values** (`5`) --- # Retrieval We can retrieve/call the object using its name as follows: ```r x1 ``` ``` ## [1] 5 ``` ```r x3 ``` ``` ## [1] 5 ``` --- # Retrieval: Three Common Errors **Case issue:** object names in
and
are **case sensitive**. ```r X1 # should be x1 instead of X1 (see last slide) ``` ``` ## Error in eval(expr, envir, enclos): object 'X1' not found ``` -- **Typo:** A spelling error of some sort (with a corresponding
error message) ```python y3 # should be x3 instead of y3 (see last slide) ``` ``` ## name 'y3' is not defined ``` -- **Object not saved:** e.g., you clicked **Enter** instead of **Ctrl + Enter** when running your code ```r rm(x2) # removing x2 from the global environment to mimic error x2 # x2 is not in the global environment (see environment) ``` ``` ## Error in eval(expr, envir, enclos): object 'x2' not found ``` --- # Arthimetic Operators While we will not specifically talk about doing math in this course, the operators below are good to know. <html> <center> <table border="1"> <tr> <th>R</th> <th>Description</th> <th>Python</th> </tr> <tr> <td>+</td> <td>addition</td> <td>+</td> </tr> <tr> <td>-</td> <td>subtraction</td> <td>-</td> </tr> <tr> <td>*</td> <td>multiplication</td> <td>*</td> </tr> <tr> <td>/</td> <td>division</td> <td>/</td> </tr> <tr> <td>^ or **</td> <td>exponentiation</td> <td>**</td> </tr> <tr> <td>x %% y</td> <td>modulus (x mod y)</td> <td>x % y</td> </tr> <tr> <td>x %/% y</td> <td>integer division</td> <td>x // y</td> </tr> </table> </center> </html> --- # Logical Operators Logical operators are operators that return `TRUE` (`True`
) and `FALSE` (`False` in
) values. <table border="1"> <tr> <th>R</th> <th>Description</th> <th>Python</th> </tr> <tr> <td><</td> <td>less than</td> <td><</td> </tr> <tr> <td><=</td> <td>less than or equal to</td> <td><=</td> </tr> <tr> <td>></td> <td>greater than</td> <td>></td> </tr> <tr> <td>>=</td> <td>greater than or equal to</td> <td>>=</td> </tr> <tr> <td>==</td> <td>exactly equal to</td> <td>==</td> </tr> <tr> <td>!=</td> <td>not equal to</td> <td>!=</td> </tr> <tr> <td>!x</td> <td>Not x</td> <td>not x</td> </tr> <tr> <td>x & y</td> <td>x AND y</td> <td>x & y</td> </tr> <tr> <td>isTRUE(x)</td> <td>test if X is TRUE</td> <td>x is `True`</td> </tr> </table> --- class: inverse, center, middle # 101: Syntax, Data Types, Data Structures and Functions --- # Coding Style > .font150[Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. <br> -- [The tidyverse style guide](https://style.tidyverse.org)] .pull-left[ ###
style guide ✅ `snake_case` ] .pull-right[ ###
style guide ✅ `PascalCase` (Python) ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#34) ] --- #
is a Vector Language: Types and Attributes .pull-left[ - **Vectors** come in **two flavors**, which differ by their **elements' types:** * **atomic vectors --** all elements **must have the same type** * **lists --** elements **can** be different - Vector have two important **attributes:** * **Dimension** turns vectors into matrices and arrays, checked using `dim(object_name)`. * The **class** attribute powers the S3 object system, checked using `class(object_name)`. ] .pull-right[ .center[<img src="https://d33wubrfki0l68.cloudfront.net/2ff3a6cebf1bb80abb2a814ae1cfc67b12817713/ae848/diagrams/vectors/summary-tree.png" width="85%">] ```r x_vec = rnorm(n=10, mean = 0, sd = 1) class(x_vec) ``` ``` ## [1] "numeric" ``` ] .footnote[ <html> <hr> </html> **Source:** The content and image are from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html) ] --- #
is a Vector Language: Atomic Vectors .pull-left[ .center[<img src="https://d33wubrfki0l68.cloudfront.net/eb6730b841e32292d9ff36b33a590e24b6221f43/57192/diagrams/vectors/summary-tree-atomic.png" width="100%">] ] .pull-right[ ```r dim(x_vec) ``` ``` ## NULL ``` **Atomic vectors have a dim of NULL, which distinguishes it from 1D arrays 😲!!!** ] .footnote[ <html> <hr> </html> **Source:** The image is from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html) ] --- #
Data Types: A Visual Introduction [1] .center[<img src="https://d33wubrfki0l68.cloudfront.net/8a3d360c80da1186b1373a0ff0ddf7803b96e20d/254c6/diagrams/vectors/atomic.png" width="60%">] - To check the **type of** an object in
, you can use the function `typeof`: ```r typeof(x_vec) ``` ``` ## [1] "double" ``` .footnote[ <html> <hr> </html> **Source:** The image is from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html) ] --- #
Data Types: A Visual Introduction [2] <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/four_popular_data_types.png" alt="The four data types that we will utilize the most in our course." width="100%" /> <p class="caption">The four data types that we will utilize the most in our course.</p> </div> --- #
Data Types: A Visual Introduction [3] <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/legos-jbryan-types.png" alt="A visual representation of different types of atomic vectors" width="100%" /> <p class="caption">A visual representation of different types of atomic vectors</p> </div> .footnote[ <html> <hr> </html> **Source:** The images are from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats#readme) ] --- #
Data Types: Formal Definitions Each of the four primary types has a special syntax to create an individual value: - Logicals can be written in full (`TRUE` or `FALSE`), or abbreviated (`T` or `F`). - Doubles can be specified in decimal (`0.1234`), scientific (`1.23e4`), or hexadecimal (`0xcafe`) form. * There are three special values unique to doubles: `Inf`, `-Inf`, and `NaN` (not a number). * These are special values defined by the floating point standard. - Integers are written similarly to doubles but must be followed by `L`(`1234L`, `1e4L`, or `0xcafeL`), and can not contain fractional values. - Strings are surrounded by `"` (e.g., `"hi"`) or `'` (e.g., `'bye'`). Special characters are escaped with `\` see `?Quotes` for full details. .footnote[ <html> <hr> </html> **Source:** The content of this slide is verbatim from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html#scalars) ] --- # Translating
Data Types to
<html> <center> <table border="1"> <tr> <th>R Data Type</th> <th>Description</th> <th>Python Equivalent</th> </tr> <tr> <td>numeric</td> <td>Decimal numbers</td> <td>float</td> </tr> <tr> <td>integer</td> <td>Whole numbers</td> <td>int</td> </tr> <tr> <td>character</td> <td>Text or strings</td> <td>str</td> </tr> <tr> <td>factor</td> <td>Categorical data</td> <td>pandas.Categorical or str</td> </tr> <tr> <td>Date</td> <td>Date values</td> <td>datetime.date</td> </tr> <tr> <td>POSIXct</td> <td>Date and time</td> <td>datetime.datetime</td> </tr> <tr> <td>logical</td> <td>Boolean (TRUE/FALSE)</td> <td>bool</td> </tr> <tr> <td>complex</td> <td>Complex numbers</td> <td>complex</td> </tr> </table> </center> </html> --- #
Data Structures: Atomic Vector (1D) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/legos-jbryan-types.png" alt="A visual representation of different types of atomic vectors" width="100%" /> <p class="caption">Keeping the visual representation of different types of atomic vectors in your head!!</p> </div> ```r dept = c('ACC', 'ECO', 'FIN', 'ISA', 'MGMT') nfaculty = c(18L, 19L, 14L, 25L, 22L) ``` .footnote[ <html> <hr> </html> **Source:** The images are from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats#readme) ] --- #
Data Structures: 1D ➡️ 2D [Visually] .center[<img src="../../figures/legos-jbryan-structures.png" width="92%">] .footnote[ <html> <hr> </html> **Source:** The images are from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats#readme) ] --- #
Data Structures: 1D ➡️ 2D [In Code] ```r library(tibble) fsb_tbl <- tibble( department = dept, count = nfaculty, percentage = count / sum(count)) fsb_tbl ``` ``` ## # A tibble: 5 × 3 ## department count percentage ## <chr> <int> <dbl> ## 1 ACC 18 0.184 ## 2 ECO 19 0.194 ## 3 FIN 14 0.143 ## 4 ISA 25 0.255 ## 5 MGMT 22 0.224 ``` --- #
Data Structures: Lists [1] An object contains elements of **different data types**. .center[<img src="../../figures/legos-jbryan-list.png" width="25%">] .footnote[ <html> <hr> </html> **Source:** The image is adapted from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats/blob/master/lego-rstats_014.jpg) ] --- #
Data Structures: Lists [2] .center[<img src="https://d33wubrfki0l68.cloudfront.net/9628eed602df6fd55d9bced4fba0a5a85d93db8a/36c16/diagrams/vectors/list.png" width="100%">] ```r lst <- list( # list constructor/creator * 1:3, # atomic double/numeric vector of length = 3 * "a", # atomic character vector of length = 1 (aka scalar) * c(TRUE, FALSE, TRUE), # atomic logical vector of length = 3 * c(2.3, 5.9) # atomic double/numeric vector of length =3 ) lst # printing the list ``` ``` ## [1] "1:3" "a" "c(TRUE, FALSE, TRUE)" ## [4] "c(2.3, 5.9)" ``` .footnote[ <html> <hr> </html> **Source:** Image is from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html#lists) ] --- #
Data Structures: Lists [3] .pull-left[ ### data type ```r typeof(lst) # primitive type ``` ``` ## [1] "list" ``` ### data class ```r class(lst) # type + attributes ``` ``` ## [1] "list" ``` ] .pull-right[ ### data structure ```r str(lst) # sublists can be of diff lengths and types ``` ``` ## List of 4 ## $ : int [1:3] 1 2 3 ## $ : chr "a" ## $ : logi [1:3] TRUE FALSE TRUE ## $ : num [1:2] 2.3 5.9 ``` ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/02-import-export.html#6). ] --- #
Data Structures: Lists [3] A list can contain other lists, i.e. **recursive** ```r # a named list str( * list(first_el = lst, second_el = iris) ) ``` ``` ## List of 2 ## $ first_el :List of 4 ## ..$ : int [1:3] 1 2 3 ## ..$ : chr "a" ## ..$ : logi [1:3] TRUE FALSE TRUE ## ..$ : num [1:2] 2.3 5.9 ## $ second_el:'data.frame': 150 obs. of 5 variables: ## ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` --- #
Data Structures: Lists [4] .pull-left[ Subset by `[]` ```r lst[1] ``` ``` ## [[1]] ## [1] 1 2 3 ``` ] .pull-right[ Subset by `[[]]` ```r lst[[1]] ``` ``` ## [1] 1 2 3 ``` ] .center[<img src="../../figures/pepper.png" width="50%">] .footnote[ <html> <hr> </html> **Sources:** The slide is based on [Earo Wang's STAT 220 slides](https://stats220.earo.me/02-import-export.html#10) and image is from [Hadley Wickham's Tweet on Indexing lists in R](https://twitter.com/hadleywickham/status/643381054758363136?lang=en). ] --- #
Data Structures: Matrices A matrix is a **2D data structure** made of **one/homogeneous data type.** .pull-left[ ```r x_mat = matrix( sample(1:10, size = 4), nrow = 2, ncol = 2 ) str(x_mat) # its structure? ``` ``` ## int [1:2, 1:2] 2 3 10 6 ``` ```r x_mat # printing it nicely print('-----------------') *x_mat[1, 2] # subsetting ``` ``` ## [,1] [,2] ## [1,] 2 10 ## [2,] 3 6 ## [1] "-----------------" ## [1] 10 ``` ] -- .pull-right[ ```r x_char = matrix( sample(letters, size = 12), nrow = 3, ncol =4) x_char ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "t" "r" "l" "q" ## [2,] "f" "v" "o" "a" ## [3,] "u" "n" "h" "d" ``` ```r *x_char[1:2, 2:3] # subsetting ``` ``` ## [,1] [,2] ## [1,] "r" "l" ## [2,] "v" "o" ``` ] --- #
Data Structures: Data Frames [1] .center[<img src="https://d33wubrfki0l68.cloudfront.net/9ec5e1f8982238a413847eb5c9bbc5dcf44c9893/bc590/diagrams/vectors/summary-tree-s3-2.png" width="22%">] > .font150[If you do data analysis in R, you’re going to be using data frames. A data frame is a named list of vectors with attributes for `(column)` `names`, `row.names`, and its class, “data.frame”. -- [Hadley Wickham](https://adv-r.hadley.nz/vectors-chap.html#list-array)] .footnote[ <html> <hr> </html> **Source:** Image is from [Hadley Wickham's Advanced R: Chapter 3 on Vectors](https://adv-r.hadley.nz/vectors-chap.html#list-array) ] --- #
Data Structures: Data Frames [2] ```r df1 <- data.frame(x = 1:3, y = letters[1:3]) typeof(df1) # showing that its a special case of a list ``` ``` ## [1] "list" ``` ```r attributes(df1) # but also is of class data.frame ``` ``` ## $names ## [1] "x" "y" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 ``` In contrast to a regular list, a data frame has **an additional constraint: the length of each of its vectors must be the same.** This gives data frames their **rectangular structure.** --- #
Data Structures: Data Frames [3] As noted in the creation of `df1`, columns in a data frame can be of different types. Hence, it is more widely used in data analysis than matrices. .center[<img src="../../figures/legos-jbryan-dataframe-w-text.png" width="40%">] .footnote[ <html> <hr> </html> **Source:** The image is adapted from the excellent [lego-rstats GitHub Repository by Jenny Bryan](https://github.com/jennybc/lego-rstats/blob/master/lego-rstats_014.jpg) ] --- #
Data Structures: So What is a Tibble? > Tibble is a **modern reimagining of the data frame**. Tibbles are designed to be (as much as possible) **drop-in replacements for data frames** that fix those frustrations. A concise, and fun, way to summarise the main differences is that tibbles are **lazy and surly: they do less and complain more**. -- [Hadley Wickham](https://adv-r.hadley.nz/vectors-chap.html#list-array) .pull-left[[<img src="https://d33wubrfki0l68.cloudfront.net/565916198b0be51bf88b36f94b80c7ea67cafe7c/7f70b/cover.png" height="320px">](https://adv-r.hadley.nz)] To learn more about the basics of tibble, please consult the reference below: * [Data frames and tibbles (Click and read from 3.6 up to and including 3.6.5)](https://adv-r.hadley.nz/vectors-chap.html#list-array) --- # Translating
Data Structures to
<html> <center> <table border="1" style="font-size: 0.8em;"> <tr> <th>R Data Structure</th> <th>Description</th> <th>R Subsetting<br>(Multiple Methods)</th> <th>Python Equivalent</th> <th>Python Subsetting<br>(Multiple Methods)</th> </tr> <tr> <td>Vector</td> <td>1D array, single type</td> <td>vector[index]<br>vector[c(1,2)]<br>vector[-1]</td> <td>List (with single type)<br>or NumPy array</td> <td>list[-1]<br>list[1:3]<br>array[1:3]</td> </tr> <tr> <td>Matrix</td> <td>2D array, single type</td> <td>matrix[row, col]<br>matrix[1,]<br>matrix[,1]</td> <td>2D List (with single type)<br>or 2D NumPy array</td> <td>list[row][col]<br>array[row, col]<br>array[row,:]<br>array[:,col]</td> </tr> <tr> <td>Data Frame</td> <td>2D table,<br>multiple types</td> <td>df[row, col]<br>df[1,]<br>df[, "col"]<br>df$col</td> <td>Pandas DataFrame</td> <td>df.loc[row, col]<br>df.iloc[row, col]</td> </tr> <tr> <td>List</td> <td>Ordered collection,<br>multiple types</td> <td>list[[index]]<br>list$element_name<br>list[[1]][1]</td> <td>List</td> <td>list[index]<br>list[index][subindex]</td> </tr> <tr> <td>Dictionary</td> <td>Key-value pairs</td> <td>list$element_name</td> <td>Dictionary</td> <td>dict[key]<br>dict.get(key)</td> </tr> </table> </center> </html> --- #
Functions A function call consists of the **function name** followed by one or more **argument** within parentheses. ```r temp_high_forecast = c(86, 84, 85, 89, 89, 84, 81) mean(x = temp_high_forecast) ``` ``` ## [1] 85.42857 ``` * function name: `mean()`, a built-in R function to compute mean of a vector * argument: the first argument (LHS `x`) to specify the data (RHS `temp_high_forecast`) .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#41) ] --- #
Function Help Page Check the function's help page with `?mean` ### Class Activity > _Please take 2 minutes to investigate the help page for `mean` in R Studio._ ```r mean(x = temp_high_forecast, trim = 0, na.rm = FALSE, ...) ``` * Read **Usage** section + What arguments have default values? * Read **Arguments** section + What does `trim` do? * Run **Example** code
−
+
02
:
00
.footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#42) ] --- #
Function Arguments .pull-left[ ### Match by **positions** ```r mean(temp_high_forecast, 0.1, TRUE) ``` ``` ## [1] 85.42857 ``` ] .pull-right[ ### Match by **names** ```r mean(x = temp_high_forecast, trim = 0.1, na.rm = TRUE) ``` ``` ## [1] 85.42857 ``` ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#43) ] --- # Use Functions from Packages .pull-left[ ```r library(dplyr) cummean(temp_high_forecast) ``` ``` ## [1] 86.00000 85.00000 85.00000 86.00000 86.60000 86.16667 85.42857 ``` ```r first(temp_high_forecast) ``` ``` ## [1] 86 ``` ```r last(temp_high_forecast) ``` ``` ## [1] 81 ``` ] .pull-right[ <br> <br> <br> <br> .center[ <img src="https://hbctraining.github.io/Intro-to-R-flipped/img/install_vs_library.jpeg" height="240px"> ] ] .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#44) ] --- # Write Your Own
Functions ```r # function_name <- function(arguments) { # function_body # } my_mean <- function(x, na.rm = FALSE) { summation <- sum(x, na.rm = na.rm) summation / length(x) } my_mean(temp_high_forecast) ``` ``` ## [1] 85.42857 ``` .footnote[ <html> <hr> </html> **Source:** Slide is based on [Earo Wang's STAT 220 Slides](https://stats220.earo.me/01-intro.html#45) ] --- # Write Your Own
Functions ```python # Translating the R function to Python def my_mean(x, na_rm=False): """ Calculate the mean of a list, with an option to ignore NaN values. Parameters: x (list): List of numbers na_rm (bool): Whether to remove NaN values before calculating the mean Returns: float: mean of the list """ if na_rm: x = [i for i in x if i is not None] summation = sum(x) return summation / len(x) # Test the function with a list containing None values temp_high_forecast = [86, 84, 85, None, 89, 84, 81] my_mean(temp_high_forecast, na_rm=True) ``` ``` ## 84.83333333333333 ``` --- class: inverse, center, middle # Generative AI: Large Language Models --- # Background: Artificial Intelligence .pull-left[ center[.bold[A [working definition](https://www.brookings.edu/articles/what-is-artificial-intelligence/) for AI]] .content-box-gray[ .bold[.red[Artificial Intelligence (AI):]] .bold[A system that acts in a way, where people might denote as "intelligent" if another human were to do something similar.] ] .pull-right[ <img src="data:image/png;base64,#../../figures/ai_applications.png" width="95%" style="display: block; margin: auto;" /> ] .footnote[ <html> <hr> </html> **Image Source:** The flowchart's content and its LaTex code were generated using ChatGPT (May 24 Version). ] --- # Background: The Road to Generative AI <br> ```r knitr::include_graphics('../../figures/generative_ai_chart.png') ``` <img src="data:image/png;base64,#../../figures/generative_ai_chart.png" alt="From big data to big models, a flow chart documenting how we got to large language models" width="100%" style="display: block; margin: auto;" /> .footnote[ <html> <hr> </html> **Comment:** You have been hearing about **big data** in SPC for over a decade now. We now have models that can digest and generate answers based on more than 45TB of text. ] --- # Background: Generative AI .content-box-gray[ .bold[.red[Generative AI:]] .bold[The objective is to generate new content rather than analyze existing data.] ] .font90[ - The generated content is based on a .bold[.red[stochastic behavior embedded in generative AI models such that the same input prompts results in different content]]. - State-of-the-art generative AI models can have up to **540 billion parameters** ([PaLM](https://arxiv.org/abs/2204.02311)). - With the increase in model size, researchers have observed the **“emergent abilities”** of LLMs, which were **not explicitly encoded in the training**. [Examples include](https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html): + Multi-step arithmetic, + taking college-level exams, and + identifying the intended meaning of a word. - LLMs are **foundation models** (see [Bommasani et al. 2021](https://arxiv.org/abs/2108.07258)), large pre-trained AI systems that can be **repurposed with minimal effort across numerous domains and diverse tasks.** ] --- # LLMs: Natural Language Based Coding Let us break down this prompt with ChatGPT .font100[ > I want you to help me use R to create an animated plot of the unemployment rate by state from FRED. Here are the steps that I want you to follow: > 1. Pull the unemployment rate for all 48 states from "2003-01-01" until "2023-01-08". The symbol will be the two letter state abbreviation + UR (e.g., OHUR). > 2. Add a column that contains the state name. > 3. Use a choropleth map, where the unemployment rates are plotting using a sequential color scheme that is colorblind friendly (use RColorBrewer). > 4. Create the animation, where each frame corresponds to a time interval. The animation should have 12 fps and show the date as part of the title. > 5. Save the animation as a GIF. ] --- class: inverse, center, middle # Recap --- # Summary of Main Points By now, you should be able to do the following: - Describe how and why we use scripted languages in this course. - Utilize the project workflow in RStudio (we will try to use that as an IDE for
and
). - Understand the syntax, data structures and functions in both
and
. - Understand the potential impact of LLMs on businesses and explore how they can be leveraged in the context of this class. --- --- # 📝 Review and Clarification 📝 1. **Class Notes**: Take some time to revisit your class notes for key insights and concepts. 2. **Zoom Recording**: The recording of today's class will be made available on Canvas approximately 3-4 hours after the session ends. 3. **Questions**: Please don't hesitate to ask for clarification on any topics discussed in class. It's crucial not to let questions accumulate. --- # 📖 Required Readings 📖 #### 🤖 LLM: Prep - [AI and the Future of Work in Statistical Quality Control: ChatSQC](https://arxiv.org/pdf/2308.13550.pdf). + Read the **abstract**, **Sections 1, 4, and 5**; feel free to skim sections 2-3. + Please feel free to test the app at: <https://chatsqc.fsb.miamioh.edu/>. --- # 🎯 Assignment 🎯 - Complete [Assignment 02](https://miamioh.instructure.com/courses/202961/quizzes/582076/) on Canvas to reinforce your understanding and application of the topics covered today as well as the assigned readings.